Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain

1 State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University,
2 Beijing Academy of Artificial Intelligence, 3 Institute for Brain and Intelligence, Fudan University,
4 University of Science and Technology Beijing, 5 Beijing Innovation Center of Humanoid Robotics

*Equal contribution, Corresponding author

Overview of RoboBench. We evaluates MLLMs as embodied brains across 5 dimensions, 14 subdimensions, and 25 tasks, with tasks color-coded by type (top left). These dimensions follow the embodied execution pipeline (bottom)—from understanding intent, perceiving the environment, planning and adapting actions, refining subgoals via affordances, diagnosing failures—capturing the core cognitive roles of System 2. Performance comparison (top right) highlights significant gaps among state-of-the-art MLLMs, with Gemini-2.5-Pro achieving the best results.

Abstract

Building robots that can perceive, reason, and act in dynamic, unstructured environments remains a core challenge. Recent embodied systems often adopt a dual-system paradigm, where System~2 handles high-level reasoning while System~1 executes low-level control. In this work, we refer to System~2 as the embodied brain, emphasizing its role as the cognitive core for reasoning and decision-making in manipulation tasks. Given this role, systematic evaluation of the embodied brain is essential for advancing robotic intelligence. Yet existing benchmarks emphasize execution success, or when targeting high-level reasoning, suffer from incomplete dimensions and limited task realism, offering only a partial picture of cognitive capability. To bridge this gap, we introduce RoboBench, a benchmark that systematically evaluates multimodal large language models (MLLMs) as embodied brains. Motivated by the critical roles across the full manipulation pipeline, RoboBench defines five dimensions—instruction comprehension, perception reasoning, generalized planning, affordance prediction, and failure analysis—spanning 14 capabilities, 25 tasks, and 6092 QA pairs. To ensure realism, we curate datasets across diverse embodiments, attribute-rich objects, multi-view scenes, and memory-driven navigation, drawing from large-scale real robotic data and in-house collection. For planning, RoboBench introduces an evaluation framework that uses an MLLM as a world simulator. It moves beyond symbolic matching to evaluate embodied feasibility by simulating whether predicted plans can achieve critical object-state changes under physical and visual constraints, enabling faithful assessment of long-horizon reasoning. Experiments on 14 state-of-the-art MLLMs reveal fundamental limitations: difficulties with implicit instruction comprehension, spatiotemporal reasoning, cross-scenario planning, fine-grained affordance understanding, and execution failure diagnosis. RoboBench provides a comprehensive scaffold to quantify high-level cognition, clarify the role of the embodied brain, and guide the development of next-generation MLLMs toward more robust robotic intelligence.

News

🔥 2025.10.23 - Dataset released on Hugging Face! Check it out at https://huggingface.co/datasets/LeoFan01/RoboBench❗️
🔥 2025.10.21 - The paper has been released! Code and dataset are being organized and will be released soon. Stay tuned❗️

Highlight

🔍 Benchmark Overview
  • The first comprehensive benchmark focused on evaluating MLLMs as embodied brains.
  • Systematic evaluation across 5 core dimensions, 14 capabilities, 25 task types, and 6,092 high-quality questions.
🧭 Comprehensive Dimensions
  • Covers key embodied skills tailored to MLLM capabilities, including instruction comprehension, perception reasoning, generalized planning, affordance prediction, and failure analysis in real-world settings.
🛡️ Robust Evaluation
  • All questions are manually verified for quality and consistency.
  • Long-horizon task planning is evaluated using a novel Directed Acyclic Graph (DAG)-guided approach to ensure rigor and robustness.
🧠 Real-world Data
  • Built on the latest open-source real-robot datasets and proprietary real-world data.
  • Evaluation tasks are grounded in realistic embodied interaction scenarios.
🌍 Diverse Composition
  • Sourced from a wide range of data and scenarios.
  • Captures the complexity and diversity of real-world embodied tasks.

Leaderboard

Model Perception Reasoning
Robotic-centric Object-centric Scene-centric Task-centric Avg
Robot-type Robot-view Static Attr. Functional Attr. Spatial Relation Temp. Grounding Causality Refer. Comprehen.
Basic Reference
Human Evaluation 80.67 79.08 43.77 83.89 70.91 51.61 91.22 93.22 74.30
GPT-4o-text-only 20.51 13.77 5.18 35.37 25.74 18.32 25.52 22.09 20.81
Closed-Source MLLMs
GPT-4o-Mini 38.75 18.84 26.43 53.66 30.36 22.65 34.25 39.67 33.08
GPT-4o 64.96 39.38 24.92 46.75 42.24 20.61 33.10 41.31 39.16
Claude-3.5-Sonnet 41.31 36.23 29.13 62.60 34.98 21.88 36.09 25.36 35.95
Claude-3.7-Sonnet 40.46 32.37 45.20 71.14 36.63 21.09 40.92 28.02 39.48
Gemini-2.0-Flash 56.69 20.77 49.08 78.46 42.57 21.37 51.72 72.40 49.13
Gemini-2.5-Flash 62.39 39.38 55.02 77.24 57.43 33.58 70.34 74.64 58.75
Gemini-2.5-Pro 64.30 41.71 54.83 82.27 60.44 49.68 71.73 78.68 62.96
Qwen-VL-Plus 28.21 21.74 34.63 58.54 27.72 21.37 31.03 34.36 32.20
Qwen-VL-Max 47.86 43.48 39.70 75.20 50.17 27.45 37.93 41.53 45.42
Open-Source Multi-Image MLLMs
LLaVA-OneVision-0.5B 30.34 23.68 37.08 49.66 27.27 18.42 23.65 19.21 28.66
LLaVA-OneVision-7B 44.83 30.26 33.43 75.84 45.45 23.68 25.68 44.63 40.48
Qwen2.5-VL-7B-Ins 23.93 26.81 37.86 46.34 31.68 22.90 14.48 36.81 30.10
Qwen2.5-VL-72B-Ins 47.72 42.75 41.74 72.95 48.51 27.87 40.32 42.13 45.50
Embodied MLLMs
RoboBrain-2.0-7B 44.97 24.84 40.43 79.19 48.18 23.48 41.22 53.67 44.50
Model Instruction Comprehension Generalized Planning
Explicit Implicit Avg Cross-Embodiment Planning Cross-Object Planning Cross-View Planning Cross-Task Planning Avg
Single-arm Dual-arm Mobile-manip. Human Material Afford. Physical Attr. World Knowl. Multi Single Navigation Plan.
Basic Reference
Human Evaluation 59.94 61.13 60.54 72.50 41.93 41.55 62.28 56.70 58.98 49.36 52.82 51.59 45.23 54.50
GPT-4o-text-only 38.80 11.10 24.95 26.70 33.32 43.65 37.86 36.58 22.33 37.68 44.35 38.11 36.90 33.95
Closed-Source MLLMs
GPT-4o-Mini 41.21 14.95 28.08 27.47 25.21 37.98 31.72 33.75 38.46 42.56 39.11 33.29 34.04 33.31
GPT-4o 45.60 19.04 32.32 28.28 32.65 52.69 35.71 39.93 46.09 41.34 38.51 33.66 39.41 37.74
Claude-3.5-Sonnet 42.11 14.85 28.48 30.18 33.65 50.29 41.05 38.28 40.67 39.63 45.95 40.43 39.77 38.07
Claude-3.7-Sonnet 47.77 14.53 31.15 29.86 38.69 50.39 37.06 38.65 41.86 51.83 48.19 44.51 39.95 41.68
Gemini-2.0-Flash 43.49 16.38 29.93 28.67 33.66 48.27 33.95 40.76 54.27 40.12 46.13 40.73 37.02 38.62
Gemini-2.5-Flash 42.53 17.10 29.82 27.05 40.46 49.91 34.50 39.87 53.37 46.22 39.41 43.29 38.32 39.33
Gemini-2.5-Pro 51.15 19.60 35.37 29.71 37.65 50.96 37.44 39.29 56.50 43.29 47.35 45.12 43.62 41.81
Qwen-VL-Plus 37.77 10.38 24.07 24.68 21.75 32.98 33.91 28.45 33.55 33.78 30.95 28.60 4.39 26.77
Qwen-VL-Max 46.45 16.98 31.71 28.30 35.73 47.79 32.40 40.44 44.33 42.32 41.79 37.68 38.00
Open-Source Multi-Image MLLMs
LLaVA-OneVision-0.5B 6.82 1.24 3.61 2.90 4.57 4.77 3.68 4.77 3.47 6.47 4.30 3.62 11.39 4.83
LLaVA-OneVision-7B 18.93 3.48 10.05 11.48 16.23 8.27 5.34 18.51 15.62 8.10 0.00 15.16 24.67 12.15
Qwen2.5-VL-7B-Ins 26.45 4.65 15.55 19.47 12.90 28.75 28.19 22.06 21.63 25.61 11.79 20.12 2.10 18.64
Qwen2.5-VL-72B-Ins 46.81 15.15 30.98 28.20 36.92 49.14 31.31 40.51 44.94 38.90 43.16 40.24 37.47 37.73
Embodied MLLMs
RoboBrain-2.0-7B 36.93 8.19 22.51 15.46 25.32 32.72 31.81 19.85 30.85 23.24 31.51 23.89 24.53 25.35
Model Affordance Prediction Failure Analysis
Static Dynamic Naviga. Avg Execution Planning Avg
Basic Reference
Human Evaluation 86.08 80.02 81.85 82.63 47.30 80.67 63.99
GPT-4o-text-only 44.89 40.70 38.19 39.88 25.17 37.93 31.55
Closed-Source MLLMs
GPT-4o-Mini 50.64 42.88 42.30 46.39 17.66 44.60 31.13
GPT-4o 55.61 49.14 49.91 51.91 22.29 57.01 39.65
Claude-3.5-Sonnet 56.26 54.25 53.84 54.77 16.12 47.52 31.82
Claude-3.7-Sonnet 60.02 52.38 50.07 54.06 18.32 54.24 36.28
Gemini-2.0-Flash 61.65 61.76 66.89 63.37 28.48 59.80 44.14
Gemini-2.5-Flash 61.20 52.04 52.01 54.29 18.54 67.65 43.10
Gemini-2.5-Pro 70.54 62.03 63.96 65.21 15.96 74.31 45.14
Qwen-VL-Plus 51.74 37.42 47.97 48.18 13.91 40.00 26.96
Qwen-VL-Max 70.01 56.26 50.85 59.43 17.22 57.93 37.58
Open-Source Multi-Image MLLMs
LLaVA-OneVision-0.5B 20.56 28.56 27.69 24.76 21.19 24.67 22.93
LLaVA-OneVision-7B 23.83 33.61 33.43 30.29 29.14 34.00 31.56
Qwen2.5-VL-7B-Ins 49.73 38.03 42.16 43.15 13.91 26.90 20.41
Qwen2.5-VL-72B-Ins 71.54 51.94 47.67 56.67 12.59 50.72 31.66
Embodied MLLMs
RoboBrain-2.0-7B 51.87 54.63 41.61 49.37 7.95 42.00 41.24

Key Findings from RoboBench Evaluation

Overall Findings

🥇 Gemini-2.5-Pro Leads but Still Trails Humans

Gemini-2.5-Pro achieves the strongest overall performance across all five cognitive dimensions. It notably scores 62.96 in perception reasoning and 65.21 / 45.14 in affordance and failure analysis—well above other models but still far below human levels (74.30 / 82.63 / 63.99). This underscores a persistent gap between current MLLMs and robust human-level embodied intelligence.

🔒 Closed-Source Models Still Hold the Advantage

Closed-source MLLMs outperform open-source ones in four out of five dimensions, often by 10–15%. Open-source models only approach parity in perception reasoning. Within each family, larger models consistently perform better, e.g., GPT-4o > GPT-4o-mini, Claude-3.7 > Claude-3.5.

🤖 Embodied Training Brings Noticeable Gains

The embodied MLLM RoboBrain-2.0-7B surpasses similarly sized general open-source models in perception reasoning, planning, and affordance prediction. This validates the effectiveness of domain-specific embodied datasets for improving multimodal reasoning and planning.

📊 Cognitive Difficulty Varies Across Dimensions

Perception reasoning yields the highest accuracies; Generalized planning remains the most challenging, exposing weaknesses in long-horizon reasoning and structured task decomposition. The contrast highlights where future progress is most needed.

Fine-grained Findings

🧠 Implicit Intent Understanding Remains a Major Challenge

Performance on implicit instructions drops by about 30% compared to explicit ones. Models struggle to infer goals from indirect human demands, revealing weak integration of language, perception, and context.

👁️ Perception and Temporal Reasoning Bottlenecks

Models misidentify robot types or viewpoints and fail to localize events in time. Temporal and causal reasoning accuracies hover around 30–40%, except for Gemini series. Stronger embodiment-aware perception and spatiotemporal reasoning modules are needed.

🧩 Planning Limitations Persist

Cross-embodiment: poor coordination in dual-arm or mobile manipulation. Cross-object: difficulty with rare or knowledge-dependent objects. Cross-view: multi-image inputs markedly improve performance (e.g., +5–7 points for GPT-4o / Claude-3.7), showing the promise of multi-view reasoning.

⚙️ Failure Analysis Is Extremely Hard

Diagnosing execution-level errors is far more difficult than planning-level ones (scores 10–20 vs. 40–60). Requires fine-grained spatial and physical understanding (e.g., distinguishing location vs. rotation errors). Even humans achieve only 47.3 on such tasks, underscoring their intrinsic complexity.

Dataset Construction Pipeline

Dataset Construction Pipeline. RoboBench integrates open-source and self-collected robot data under a shared process—preprocessing → tool-assisted + human-in-the-loop annotation → unified schema → auto-generated QA—and builds datasets for five dimensions: Instruction Comprehension: pair explicit instructions with LLM-rewritten implicit variants to test intent understanding. Perception Reasoning: use captioning/detection/segmentation tools to draft labels across robotic/object/scene/task views, then human-refine and standardize. Generalized Planning: construct a planning pool from robot videos; VLMs produce step/timestamp summaries and metadata, which are mapped to function templates to support Q1/Q2/Q3 evaluations. Affordance Prediction: sample key frames and annotate static (contact points), dynamic (trajectories), and mobile (base positions) affordances. Failure Analysis: mine execution-level failures from real trials and synthesize planning-level errors by perturbing correct instructions. All outputs follow one schema and are rendered into binary, single-choice, and multi-step multiple-choice QA formats for open- and closed-source MLLMs.

Planning Evaluation Pipeline

Planning Evaluation Pipeline

Planning Evaluation Framework. Evaluation of the planning dimension (Q1–Q3). Each task is decomposed into a sequence of parameterized atomic actions forming a Directed Acyclic Graph (DAG) that encodes causal and temporal dependencies. For Q1 (Long-horizon planning), an MLLM-based world simulator assesses both NodeCorrectness (action alignment) and TaskCompletion (goal-state achievement) by simulating action rollouts under visual and physical constraints. Q2 (Next-step planning) evaluates fine-grained step prediction by comparing skill, object, and parameter accuracy, while Q3 (Task state estimation) measures binary correctness on whether a subtask has been completed. Together, the pipeline provides a unified, interpretable framework for assessing structural correctness and embodied feasibility in planning.

Demo Case


Demo Case

BibTeX

@misc{luo2025robobenchcomprehensiveevaluationbenchmark,
                title={Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain}, 
                author={Yulin Luo and Chun-Kai Fan and Menghang Dong and Jiayu Shi and Mengdi Zhao and Bo-Wen Zhang and Cheng Chi and Jiaming Liu and Gaole Dai and Rongyu Zhang and Ruichuan An and Kun Wu and Zhengping Che and Shaoxuan Xie and Guocai Yao and Zhongxia Zhao and Pengwei Wang and Guang Liu and Zhongyuan Wang and Tiejun Huang and Shanghang Zhang},
                year={2025},
                eprint={2510.17801},
                archivePrefix={arXiv},
                primaryClass={cs.RO},
                url={https://arxiv.org/abs/2510.17801},
            }